Anti-money laundering is always an important concern in the financial industry such as banks. In general, they want to design a better procedure to efficiently flag the potentially ``strange’’ behaviors of their customers (so as to make a phone-call check later on) by keeping track of some of the important information in their transaction but also trying not to bother their clients too much.
Our motivation is to detect the potential money laundering behavior from the dataset we have. Before diving into the data analysis, we definitely need to think about the meanings of each variable and we may need to do some transformation on them to make them useful to us. In the first stage of modeling, one attempt is to cluster the customers based on the values of their transaction variables then further explore the points in each cluster. In this way, we are able to have an initial feelings about the identies or the types of customers based on their transaction behaviors, and at the same time we should try to interpret each cluster to think about what really happen there. To have a better interpretable way may be also a guideline for us to design better cluster algorithms.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(TDA, rgl,data.table,DT,knitr,plotly)
#knit_hooks$set(webgl = hook_webgl)
options(scipen = 9999, warn = -1, digits= 4)For this anti-money laundering data analysis project, here is the dataset we are working on, which is offered by Scotia AML specialists.
We have 100,000 clients. For each client it has eight transaction types reflecting the account behaviors and six months of records.
Eight transaction types:
Six months of records:
The data set is made purposely and arranged well, so we do not need to do the dirty data cleaning. But in order to tell the characteristic of a client, we may need to further process the attribute for each client.
Here is the data struture of the information for one client.
The idea here is to derived some summary quantities for each variable (column) by utilizing their values within 6 months.
datatable(lookUpID(3,Mall),class="table-condensed", options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
scrollX = TRUE,
pageLength = 5,
lengthMenu = c(5, 10, 15, 20),
caption="Client ID = 3"
))For example, I computed the sample average and the sample standard deviation for each month. The summarized data set is shown below.
summaryTransform <- function(x){
indexSeq <- 1+seq(0,40,by = 8)
c(x[1],
mean(x[1+indexSeq]),sd(x[1+indexSeq]),
mean(x[2+indexSeq]),sd(x[2+indexSeq]),
mean(x[3+indexSeq]),sd(x[3+indexSeq]),
mean(x[4+indexSeq]),sd(x[4+indexSeq]),
mean(x[5+indexSeq]),sd(x[5+indexSeq]),
mean(x[6+indexSeq]),sd(x[6+indexSeq]),
mean(x[7+indexSeq]),sd(x[7+indexSeq]),
mean(x[8+indexSeq]),sd(x[8+indexSeq]))
}
summaryTransform <- function(x){
indexSeq <- 1+seq(0,40,by = 8)
c(x[1],
mean(x[1+indexSeq]),sd(x[1+indexSeq]),
mean(x[2+indexSeq]),sd(x[2+indexSeq]),
mean(x[3+indexSeq]),sd(x[3+indexSeq]),
mean(x[4+indexSeq]),sd(x[4+indexSeq]),
mean(x[5+indexSeq]),sd(x[5+indexSeq]),
mean(x[6+indexSeq]),sd(x[6+indexSeq]),
mean(x[7+indexSeq]),sd(x[7+indexSeq]),
mean(x[8+indexSeq]),sd(x[8+indexSeq]))
}
transMall <- apply(Mall,MARGIN = 1,summaryTransform)
transMall <- t(transMall)
colnames(transMall) <- c("ID",
paste(c("m","s"),rep(c(1:8),each=2),sep = ""))
#str(transMall)
datatable(head(transMall, 50),class="table-condensed", options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
scrollX = TRUE,
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)
))A quick summary of each columns are shown below. It is noted that s7 are all zeros. This is reasonable because people paid the same morgage every month.
summary(transMall[,-1])## m1 s1 m2 s2
## Min. : 1 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 113 1st Qu.: 9 1st Qu.: 0 1st Qu.: 0
## Median : 345 Median : 38 Median : 0 Median : 0
## Mean : 23337 Mean : 3236 Mean : 635 Mean :118
## 3rd Qu.: 8574 3rd Qu.: 4941 3rd Qu.: 442 3rd Qu.:272
## Max. :462891 Max. :63332 Max. :5662 Max. :890
## m3 s3 m4 s4
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 50 1st Qu.: 2.2 1st Qu.: 0 1st Qu.: 0
## Median : 309 Median : 6.1 Median : 53 Median : 5
## Mean : 869 Mean : 182.7 Mean : 2402 Mean : 452
## 3rd Qu.: 764 3rd Qu.: 291.8 3rd Qu.: 3531 3rd Qu.: 783
## Max. :8865 Max. :1645.8 Max. :44293 Max. :5533
## m5 s5 m6 s6
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 210 1st Qu.: 0
## Median : 53 Median : 5 Median : 386 Median : 38
## Mean : 2401 Mean : 453 Mean : 609 Mean : 217
## 3rd Qu.: 3560 3rd Qu.: 779 3rd Qu.: 607 3rd Qu.: 73
## Max. :37874 Max. :6370 Max. :5267 Max. :3598
## m7 s7 m8 s8
## Min. : 0 Min. :0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.:0 1st Qu.: 0 1st Qu.: 0
## Median :1933 Median :0 Median : 685 Median : 0
## Mean :1917 Mean :0 Mean : 1765 Mean : 153
## 3rd Qu.:2580 3rd Qu.:0 3rd Qu.: 3077 3rd Qu.: 281
## Max. :8548 Max. :0 Max. :17545 Max. :2308
# Remove id and s7
transMallsub <- scale(transMall[,-c(1,15)])
PCAMall <- princomp(transMallsub)
plot(PCAMall) #Show variations explained by each PC
# Compute PC's loadings and PC's
PCloading <- PCAMall$loadings[,1:15]
datatable(round(PCloading,2),class="table-condensed", options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
scrollX = TRUE,
pageLength = 15,
lengthMenu = c(15,5,10)
))pcaX <- transMallsub %*% PCAMall$loadingsHere shows a 3-dimension plot of PC1, PC2 and PC3. By visual checking, there are 6 distint clusters. In particular, there is one small cluster that might be suspicious. We can do more analysis on the clients contained in this cluster.
lab <- paste("PC",
1:3,
"(",
round(PCAMall$sdev[1:3]^2/sum( PCAMall$sdev^2)*100,2),
"%)",
sep="")
colSumPayroll <- rowSums(abs(Mall[,9+seq(0,40,8)]))
colIndex <- 1+as.numeric(colSumPayroll==0)
colIndex <- ifelse(colIndex==1,"black","red")
plot3d(pcaX[,1:3],
xlab=lab[1],ylab=lab[2],zlab=lab[3],
col=colIndex,alpha=0.3)
legend3d("topright", legend = c("No Pay","Pay"), pch=c(16,16),
col = c("black","red"), inset=c(0.02),cex=1.2)
rglwidget()